Value Iteration Algorithm

The process that we went through to determine the optimal policy for the mountainous environment was fairly straightforward, but it did take some intuition to identify which action was optimal for every state. In larger more complex environments, intuition may not be sufficient. In such environments, an algorithm should be applied to handle all computations and find the optimal solution to an MDP. One such algorithm is called the Value Iteration algorithm. Iteration is a key word here, and you’ll see just why!

The Value Iteration algorithm will initialize all state utilities to some arbitrary value - say, zero. Then, it will iteratively calculate a more accurate state utility for each state, using $U(s) = R(s) + \gamma max_a \Sigma_{s'} T(s,a,s') U(s')$

Algorithm

$U' = 0$

$\text{loop until } \textit{close-enough}(U,U')$

$\quad U = U'$

$\quad \text{for }s \text{ in }S\text{, do:}$

$\quad\quad U(s) = R(s) + \gamma max_a \Sigma_{s'} T(s,a,s') U(s')$

$\text{return }U$

With every iteration, the algorithm will have a more and more accurate estimate of each state’s utility. The number of iterations of the algorithm is dictated by a function $\textit{close-enough}$ which detects convergence. One way to accomplish this is to evaluate the root mean square error,

RMS = \frac{1}{|S|} \sqrt{\sum_{s}(U(s) - U&#x27;(s))^2}

Once this error is below a predetermined threshold, the result has converged sufficiently.

RMS(U,U&#x27;) &lt; \epsilon

This algorithm finds the optimal policy to the MDP, regardless of what $U'$ is initialized to (although the efficiency of the algorithm will be affected by a poor $U'$ ).

Next Concept